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ABSTRACT 

Seven experiments were conducted concerning decision 
making and information processing under conditions of uncertainty. 
Several different experimental tasks were used; all presented the 
subject with multiple independent sources of information regarding 
the likelihood that some event would occur. Study 1 subjects were Air 
Force pilots; all other subjects were undergraduate college students. 
The independent variables included the number of inputs, the inputs' 
reliability, the effects of discrepant inputs, the format of the 
problem, and the time available to respond. All tasks and subsequent 
analyses were non-Bayesian , yielding both normative and idiographic 
information. The subjects 1 basic instruction was to indicate the 
event most likely to occur or to estimate the likelihood of a given 
event. Results indicated no significant differences between F-16 
pilots and student pilots in their use of averaging as the 
predominant strategy chosen. Simpler strategies were adopted when the 
time allocated for the task was reduced. The consistent use of a 
strategy was disrupted when the subject's first experience to the 
task was under time restrictions. When postdecision feedback was 
unreliable, the consistency of subsequent decision-making patterns 
was disrupted. The effects of information reliability were equivocal. 
Unreliable information was sometimes incorporated into the 
decision-making processes and sometimes ignored. (Author/YLB) 
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SUMMARY 



Objective 

The objectives were (a) to identify the types of decision-making strategies used b> college students and Air Force 
pilots when processing probabilistic information, (b) to relate the types of strategies to the reliability, probabilistic 
distribution, and format of the information available for basing a decision, (c) to ascertain the etfects of differentially 
reliable feedback on information selection strategies, (d) to assess the impact of time limitations or strategy choice and 
accuracy, and (e) to develop new experimental paradigms that model different aspects of decision making and information 
processing under a wide variety of uncertainty conditions. 

Background/Rationale 

Little is known about how people process information, form strategies, and make decisions in situations containing 
unreliable, contradictory, or uncertain information. Many in-flight piloting situations are of this type, particularly in 
tactical situations. Results from previous decision-making research suggest that fighter pilots tend to adopt strategies 
similar to college students, but pilots appear to be more consistent in their use of a given strategy. The present effort 
explores the apparent difference in decision-making consistency between experienced pilots and college students. 

Approach 

Seven separate experiments were conducted in which some aspect of the information available to the subject for 
decision making was varied. Experiment 1 investigated the differences between separate groups of Air Force pilots at 
the beginning, middle, or near completion of Undergraduate Pilot Training (UPT) on a task requiring them to indicate 
which of two events was more likely to occur based upon independent probability estimates of the events. The data from 
the UPT pilots were compared with data collected previously from F-16 and F-15 pilots. The lemaining experiments 
were conducted at Bowling Green State University using college students as subjects. Each experimental task was 
designed to model selected aspects of real-world decision making such as the time available for deciding information 
reliability, the presentation format, and feedback consistency. 

Specifics 

Method. Experiment 1 was conducted at Williams AFB using three groups of UPT trainees (28 beginning, 27 
intermediate, 21 advanced) on a task requiring the subject to indicate which of two events (A or B) they thought would 
be the most likely to occur based on several separate (i.e., independent) probability estimates for each event . The number 
of pieces of information was either 3, 5, or 7. The task was self-paced. The candidate strategies were averaging, adding, 
largest cue, and most cue. Experiment 2 used the same paradigm but varied the reliability of the probability estimates 
by designating each as either of high, medium, or low reliability. Additionally, the arrays contained outlier estimates. 
Nineteen college students participated in this effort. Experiment 3, using a similar paradigm, manipulated the time 
available to make a decision. The effects of time restrictions were compared to self-paced decision making. Twenty-two 
college students served as subjects. Experiment 4 adopted a different paradigm by asking the subjects to make an estimate 
of the overall likelihood of occurence. Four array variables were manipulated: (a) the probabilistic distance between 
outlying and clustered estimates, (b) the direction of outlying estimates relative to the clusters, (c) the density of the 
clusters, and (d) the symmetry of the array. The task was self-paced using 20 college students as subjects. Experiment 
5 varied the format of an array of estimates in either a histogram, list form, or geometric numeric format. The task was 
to indicate the average value of the array. The cue sets had either three or five sources of information, the presentation 
times were either 3, 6, or 9 seconds. Fifteen college students served as subjects. Experiments 6 and 7 used an entirely 
different experimental task in which the subject was to decide which of two diseases was present. The subject could 
ask for information that would aid in making a medical diagnosis. The information was arranged in such a fashion as 
to be either valuable or not in certain combinations with other information. Both experiments used nursing students as 
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subjects (29 in experiment 6 and 13 in experiment 7). In both efforts the reliability of the feedback was varied according 
to predetermined rules. 

Findings and Discussion. The results indicated thai all the groups of pilots tended to use the averaging strategy 
most often and that they did not differ in their consistency of strategy use. Their response patterns were similar to 
previously collected data on F-15 and F-16 pilots and college students* Results were equivocal with respect to the effects 
of reliability because it was unclear that the subjects understood the meaning of "reliability." Under certain 
circumstances the subjects behaved as predicted (e.g., higher weighting given to information of high reliability), and 
in other circumstances they tended to disregard differences in reliability. The effects of time restrictions were to increase 
error and to increase the use of simpler strategies (e.g., cue sum rather than cue average). The disruptive effects of time 
restrictions can be prevented by allowing the subject initially to self-pace. For the kinJs of tasks used in this effort, 
the geometric numeric formal results in significantly less subject processing error than histograms or lists. There were 
strong individual preferences between format types. The data using the diagnostic paradigm indicated that the subjects 
tend to choose diagnostic ally worthless information and continue to do so under a variety of feedback conditions. 



Conclusions/Recommendation* 

Based on results from the time limitations experiments, it can be recommended that initial training should be 
accomplished in a self-paced situation even if the criterion environment will be time limited. Additional research uoing 
different experimental task* needs to be conducted in order to explicate the role of information reliability and consistency. 
The results from the display format study indicate that display format does make a difference in accuracy and that 
individuals differ significantly in their ability to use different formats. Therefore, careful consideration should be given 
to the display in the design phase. The results from the diagnostic experiments indicate that people persist in choosing 
inappropriate strategies when searching for information. Therefore, training for situations in which a pilot may need to 
make a "diagnosis" and has several alternative information sources should include the logic of search. Although the 
series of experiments conducted in this effort succeeded in providing new methods for studying analogues of real world 
situations, future research should concentrate on still broader paradigms modeled after more specific application areas. 
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I. INTRODUCTION 
Problem Area 

A fundamental attribute of humans is the ability to process multiple 
sources of stimulation and develop a single response* One of the many 
ways of conceptualizing this process has involved postulating that the 
environmental information, the "inputs, 11 are represented in the cogni- 
tive system as discrete probabilities or as probability distributions. 
In psychological research such as probability learning, the investigator 
presents information to a person in some form other than probabilities, 
analyzes the data as though the person had encoded the data into proba- 
bilistic form, and then processes those probabilities using some 
algorithm, normative or otherwise. 

There is, also, a substantial number of studies in which the experi- 
mental data are presented in essentially probabilistic form and subjects 
are required to respond with probabilities. This technique, the so- 
called Bayesian aggregation paradigm, was thoroughly explored in a 
widely cited article by Peterson and Beach (1967), This model assumes 
that people revise their opinions of the probability (P) of some event 
(E) given some data (D) by taking into account the probabilities of the 
data given the possible event. This is represented by conditional. pro- 
babilities P(D/Ei), P(D/E2), ... which are read as "The probability of 
the data given Event i, the probability of the data given Event 2," etc. 
Under the assumptions of this model, a pilot who observes two different 
indicators of a malfunction and who must make a decision must first 
retrieve the following information from memory: 
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P (Accident prior to the Anomalous Readings) 
P (Indicator Reading 1, given an Accident) 
P (Indicator Reading 1, given no Accident) 
P (Indicator Reading 2, given an Accident) 
P (Indicator Reading 2, given no Accident) 

for each indicator. These probabilities are then aggregated into an 

estimate of 

P (Accident, given both Sets of Readings). 
Normative statistical theory, specifically Bayesian theory, pro- 
cesses information in this manner; however, the problem is that people 
do not: The consensus of the scientific community on this issue seems 
to be that, except for primitive sensory and perceptual processes, 
people do not process data in the fashion described. Nevertheless, it 
is the authors 1 belief that people process data using probabilistic 
representation — at least many people do. How they do it is the impor- 
tant question. 

The authors 1 belief substantiated by preference data, is that people 
do not aggregate multiple sources of data by combining P(D/E), that is, 
the probability of data given the event. The belief, rather, is that 
people, contrary to rormative models, aggregate, in a statistical sense, 
the wrong data. People aggregate multiple sources of information by 
aggregating P(E/D) values; that is, the probabilities of events given 
the data. 

A pilot might take tne probability of an accident given a low alti- 
tude reading and integrate it somehow with the probability of an acci- 
dent given an additional oil pressure problem. The "somehow" is the 
point of this research. 

A researcher could provide data to people in some non-probabilistic 
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form and then theorize as 1_f the people converted the information to P 
values. Conversely, the researcher could begin with simpler investiga- 
tions and present P values directly. The strategy adopted in these 
first studies investigates how people aggregate these values when given 
P(E/D) values themselves (i.e., numbers) from sources which conflict. 
Unfortunately, thera is no normative model that says how people ought to 
do this. 

II. RESEARCH EFFORT 

Most Important cognitive tasks require that multiple sources of 
information be processed into single responses. Quantities of infor- 
mation, ranging from a few inputs to enormous amounts of data, are 
reduced to a yes-no, a button push, a turn of a wheel, or a simple 
directive. This process of data reduction has been studied in concept 
formation, problem solving, information integration, and information 
processing. Work within these experimental paradigms is furthering the 
understanding of this fundamental process of information reduction. 

In 1978, Jones, Schipper, and Holzworth (JSH) introduced a new para- 
digm for investigating decision-making and information aggregation, 
which, unlike apparently related paradigms, presents decision-makers 
directly with estimates of the likelihoods of events of interest. As 
noted earlier, this is considered more representative of decision making 
in the real world. This new paradigm provides for both idiographic and 
group analyses. 

Consider the following arrangement. An observer has several inde- 
pendent sources of information concerning the occurrence of some event. 
The same observer haa. several independent and different sources of 
information concerning the occurrence of a second event. The observer's 
task is, first, to consider those sources of information concerned with 
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the first event, second, to consider those sources of information con- 
cerned with the second event, and then to make a decision as to which of 
the two events is more likely to occur. 

Example 1. A hypothetical illustration of tnis type of task might 
be a situation in which one source of information says simply that The 
probability of war in the Middle East is .30. A second source of Infor- 
mation says that the probability of war in the Middle East is .40. A 
third source of information says that the probability of war in the 
Middle East is .60. A fourth source of information says that probabi- 
lity of war in Central Africa is .20. A fifth source of information 
says that the probability of war in Central Africa is .70. The observer 
must decide in which area war is more likely to occur. 

Example 2. An indicator device provides information that the proba- 
bility of a faulty landing system is .30. A second independent indica- 
tor says that the probability of a faulty landing system is .50. A 
third source of information says that the probability of improper left 
engine spool -up is .20, and a fourth system status readout shows the 
probability of an improper lef v , engine spool -up to be .60. Which is the 
more likely event: landing system failure or engine trouble? 

Numerous situations involving the same type of format occur in many 
decision-making situations when information is presented concerning one 
or more possible forthcoming events. 

Subjects in studies 1, 2 and 3, as well as in certain studies 
accomplished before contract initiation, used the JSH paradigm. This 
paradigm involves presenting subjects with a large number of arrays, 
each carrying information from one or more sources about the probability 
that event A will occur and information from two or more other sources 
about the probability that event B will occur. Figure 1 gives three 

ERIC 13 



examples of what an array would look like to the subject. 

The JSH paradigm is distinguished from apparently similar paradigms 
in important ways* First, the information is presented numerically, 
rather than as substantive variables or as diagnostic events. Second, 
the probabilities are already P(E/D) values, but ones which do not agree 
with one another. Finally, the analysis relies on obtaining many 
responses from each subject, then comparing the subject's set of respon- 
ses with various sets of predicted responses, each set having been pre- 
dicted based on alternative possible process models. In this way, 
certain possible descriptive models can be ruled out for given subjects. 
The conclusions about the process model used are only tentative since 
other, untested process models may make predictions similar to the ones 
most consistent with a subject's responses. In other words, some models 
are ruled out, others are supported, but none can be proved. The analy- 
tical procedure is outlined in Figure 1 and described fully in the ori- 
ginal JSH paper. The theoretical strategies against which each 
subject's data were compared were chosen to represent the ones that sub- 
jects are likely to use. 

III. Study 1. A Comparison of Information Processing 



In 1979 and 1980, an inventory similar to that of JSH was used at 
Williams AFB to obtain data from pilots beginning training to fly the 
F-16 and from some of their instructor pilots. This inventory consisted 
of 194 items with the numbers of sources of information (cues) ranging 
from three through seven. 
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Averaging A Averaging B Averaging B 

Largest Cue — A Largest Cue — A Largest Cue — B 
Most Cues B Most Cues — — A Most Cues A 



In the first panel, five probabilistic cues are shown: two 
indicate that the probabilities of Event A are .hO and .70, 
while three indicate that the probabilities of Event B are 
•20, .30, and .£0. The choice of Event A is consistent with 
Adding-^ Averaging, and the Largest Cue strategies but essen- 
tially rules out a Most Cues strategy. Given a subset of 
items like item 1, consistent choice of ^vent A allows the 
investigator to rule out the possibility that an observer 
was using a Most Cues strategy. Consistent choice of Event 
B would rule out Adding, Averaging, and Largest Cue. The 
items shown in panels 2 and 3 provide other arrangements and 
show how evidence can be amassed for, and even more strongly 
against, particular stragegies. 

Panel 2 shows an item in which the choice of B is consistent 
with Averaging but may also be consistent with other unspeci- 
fied models. The choice of B, however, clearly rules out 
Adding, Largest Cue, and Most Cues strategies if such choice 
b *iavior is sufficiently consistent. Panel 3 shows an arran- 
gement in which each choice would be consistent with two 
strategies and inconsistent with two other strategies* 



Figure 1. Three sample items and descriptions of predictions 
from four possible strategies. 
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In September 1981, data using the same inventory were obtained from 
76 Undergraduate Pilot Trainees (UPTs) at Williams AFB. This sample of 
trainees was made up of, roughly, 37% Beginning UPTs (UPT-B), 36% 
Intermediate UPTs (UPT-I), and 28% Advanced UPTs (UPT-A), where the 
three classifications refer to the length of time the trainees had been 
in UPT training. This part of the effort compared the information pro- 
cessing and decision making characteristics of a sample from the UPT 
population (with relatively short training in the Air Force) with an 
F-16 transitioning group (with relatively long training in the Air 
Force). 

Figure 1 summarizes the types of items included in the inventory. 
A complete description of item selection is given in JSH. The experi- 
mental items were problems similar to those shown in Figure 1, arranged 
randomly, one to a page, bound in a looseleaf, three-ring binder. Four 
different random arrangements of the sets of problems were used. The 
pilots received a printed set of instructions and completed the problems 
at their own pace. Responses were made on mark sense answer sheets. 
Pilots were administered the problem set individually or in small groups 
of not more than four members. Communication within groups and between 
successive respondents seemed to be minimal. 

Table 1 presents a summary of analyses of the 37 F-16 pilots using 
3, 5, and 7 cues for making decisions. Entries in Table 1 for each 
pilot represent scores which have minimum values of 0 and maximum values 
of 100. These entries show the differences between a pilot's responses 
and those that would have been 100% consistent with the strategies 
designated at the top of the table. Hence a large value is a virtual 
guarantee that the named strategy was not what the subject was thinking 
about. 
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Table 1. Indices of Strategy use 
for 37 F-16 Pilots in Study 1 



Tal>le entries *ire percentages uf pilots' choice responses that were inconsistent with the 
a 

strategj designated. Indices are presented for three, five, and seven cues and for the 



mean of those treatments (x) . 
Adding 

Pilot 3 5 7 x 



Averaging 



Largest Cue 



Most Cues 



Strategies 



1 

2 
3 
4 
5 
6 
7 
8 
9 
10 

11 
12 
13 
14 
15 
16 
17 
18 
19 
20 

21 
22 
23 
24 
25 
26 
27 
28 
29 
30 

31 
32 
33 
34 
35 
36 
37 

Mean 



71 
45 
58 
0 
77 
71 
42 
65 
55 
29 

68 
77 
77 
0 
32 
71 
77 
32 
77 
68 

74 
74 
26 
48 
35 
71 
74 
77 
58 
65 

65 
58 
68 
61 
71 
74 
42 

58 
11 



62 
40 
35 
0 
64 
64 
51 
62 
38 
28 

55 
62 
64 
2 
26 
55 
66 
21 
34 
66 

62 
62 
26 
47 
47 
57 
60 
62 
51 
64 

60 
51 
72 
53 
57 
57 
38 

50 
16 



41 

33 
31 
3 
46 
62 
36 
46 
36 
21 

38 
49 
51 
3 
18 
41 
54 
31 
49 
54 

49 
46 
13 
38 
33 
49 
46 
49 
41 
54 

49 
36 
56 
38 
46 
44 
36 

39 
14 



58 
39 
48 
1 
62 
66 
43 
58 
43 
26 

54 
63 
64 
2 
25 
56 
66 
28 
63 
63 

62 
61 
22 
41 
38 
59 
60 
63 
50 
61 

58 
48 
62 
51 
58 
58 
39 

49 
14 



6 
39 
19 
77 
0 
6 
35 
13 
23 
48 

10 
0 
0 

77 

45 
6 
0 

84 
0 

10 

3 
3 
52 
29 
42 
6 
3 
0 
19 
13 

13 
19 
10 
16 

6 

3 
35 

21 
72 



2 
28 

9 
64 

0 

4 
13 

2 
26 
36 

9 
2 
0 
62 
43 
17 
2 
55 
0 
2 

2 
2 
51 
17 
17 
6 
4 
2 
11 
9 

4 
13 

2 
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Note that these values do not reflect the % of time a strategy was used, since one choico 
was often consistent with more than one strategy. Pilot 5, for example, probably used an 
averaging strategy all the time, but made a mistake in one of the seven cue items. The residual 
from the other strategies are not 100% since on some items, other strategies made the same 
prediction as the averaging strategy. Thus, the pilots who have all very low values for one 
strategy may be inferred to have used that strategy, or some unspecified one that makes the 
same predictions. High values permit one to conclude decisively that particular strategy was 
not used, 
b. 

Percentage of pilots for whom the designated strategy had the lowest percentage index, for 
each cue level and for the averago. 
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Of particular interest are (a) the overall use of strategies, (b) 
the similarities (or, conversely, the differences) among the pilots, and 
(c) whether strategies changed as the numbers of cues varied. Table 2 
presents the same information as Table 1 for each of the UPT groups. 
The summaries in Tables 1 and 2 can be compared directly. 

Statistical analyses show no reliable overall differences among the 
four pilot groups (F-16, UPT-B, UPT-I, UPT-A), but do show highly 
reliable differences among strategy uses. The Averaging strategy is 
used more frequently than are the Adding and Most Cues strategies. The 
Largest Cue strategy is used more frequently than is the Most Cues stra- 
tegy. And the Adding strategy is used more frequently than is the Most 
Cues strategy. 

A reliable interaction of strategy by pilot type shows differential 
use of strategies according to the type of pilot. That is, all pilot 
types showed reliably different preferences for strategies, tending 
heavily to rely on an Averaging strategy, but different pilot types 
tended to use different strategies when Averaging was not used. 

A summary of relative magnitudes of residuals from respective stra- 
tegy types can be obtained by ranking these residuals for the individual 
pilots. That is, over three, five, and seven cues combined, the four 
residuals, one f r each strategy, are ranked from 1 through 4. These 
ranks, pilot group by pilot group, can then be correlated to assess the 
degree of homogeneity within each group. Essentially, the statistic 
describing this degree of homogeneity is the Kendall Coefficient of 
Concordance and these coefficients are shown in Table 3, along with the 
11 patterns of ranked residuals generated by the pilots. 

The data can be interpreted in a fairly straightforward way. When 
pilots and pilot trainees are asked to aggregate probabilistic infor- 
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Table 2. Indices of Strategy use 
for 76 UPT pilots in Study 1 

Table entries are percentages of pilots' choice responses 

that were inconsistent with the strategy designated a 
M^iJlS Averaging Largest Cue Most Cues 

Pilot 5 5 7 x 3 5 7 x 3 5 7 x 3 5 7 
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Table 3» Ranked Residuals 
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This table shows the ranked mean residuals for the 
Adding strategy (column 1 of the Pattern), Averaging 
strategy (column 2), Largest Cue strategy (column 3)> 
and Most Cues strategy (column k) for each of the 
pilot groups. Thus , the first pattern shows the 
Averaging strategy (rank k) to have the lowest resi- 
dual, the Largest Cue strategy (rank 3) to have the 
next lowest residual, the Adding strategy (rank 2) 
to have the next lowest residual, and the Most Cues 
strategy (rank l) to have the highest residual. 

Only these 11 patterns of residual ranks were used 
among all 113 pilots. 

The lower the residual, the stronger the indication 
of strategy use. 

The entries under each pilot type show the numbers 
of pilots who gave that particular pattern. 

The Coefficient of Concordance (w) gives a measure 
of the degree of homogeneity within each pilot group. 
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mation of the form used in this study, most, by far, average the proba- 
bilities. However, to reiterate, averaging is not necessarily the 
"correct," or normative solution. There is no normative solution in the 
domain of probability theory. If odds were presented and the odds were 
averaged before converting to P values, very different answers would be 
obtained than if the odds were first converted to probabilities, then 
averaged. This is, of course, related to the scale properties of the 
various methods of encoding uncertainty. There is no particular reason 
to believe that either objective or subjective uncertainty, when encoded 
as probabilities, is measured on the interval scale necessary for 
averaging. 

The result, that people generally average, is novel only in the 
sense that so far as is known, subjects had not previously been placed 
in this decision situation. It is not surprising, in that averaging 
behavior has been found by Norman Anderson and his students (1981), 
among others, to be a nearly ubiquitous form of information aggregation. 
Perhaps what is most surprising about the data is that against this 
backdrop of averaging behavior, there are several individuals who con- 
sistently used some other form of information aggregation. For example, 
pilots 4 and 14 in Table 1, and pilots 7, 25, 37, and 40 in Table 2 used 
a strategy that is almost perfectly predicted by assuming that they 
simply summed the P(E/D) values. Pilot 24 in Table 2 clearly just went 
with the event with the largest P(E/D) value. It would be most 
interesting to know if these highly systematic differences in infor- 
mation processing in this task generalized to other cognitive tasks. 

It is also clear from Tables 1, 2, and 3 that the intermediate UPT 
group was by far the most heterogeneous of the pilot groups. Without 
replication, it is difficult to interpret such data, since the dif- 
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ference may be essentially a cohort effect. 

Tables 1 and 2 show that the JSH methodology was reasonably success- 
ful in isolating strategies, though there are clearly subjects whose 
data are not explained by any of the four hypothesized strategies. 
Subject 2 in Table 1 is one such case. Other plausible strategies have 
been considered, and the strategies of still more of the subjects may be 
identified. A prime possibility is a "Most Cues Over .5" strategy. In 
general, although the pilots tended to average these event estimates, 
there are "mavericks" - highly self-consistent, systematic mavericks. 
But overall these data are basically consistent with the university stu- 
dent data of the original JSH paper. 

Study 1 represents that part of the effort which used Air Force 
pilots and pilot trainees as subjects. All subsequent studies were 
carried out in the Department of Psychology laboratories at Bowling 
Green State University, using undergraduate students as experimental 
subjects. 

Two important questions from our earlier work and other resec h 
were as follows: 

1. How would information be processed for making decisions if all 
sources of information were not equally reliable? What if one source 
was more or less reliable than the others? Would decision makers 
tend to weight information equally, or would they somehow tend to 
discount certain sources of information and emphasize others? 

2. How do decision makers treat information that seems to be at 
odds with the majority of information they already have? What if one 
source of information seems to be discrepant in comparison with the rest 
of the information? 

This latter question arose from discussion with students, as well as 
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with F-16 pilots, who seemed to have similar ideas as to how they 
reacted to an outlying source of information, or what is called an 
outlier . 

IV. Study 2. Reliability of Information as a Factor 
in Decision Making 
Figure 2 shows the sets of cues presented to the observers for pro- 
cessing and describes how the experimental variable reliability was 
manipulated. 

The cue sets exemplified in Figure 2 were presented one at a time to 
each subject using a self-paced procedure. The visual presentations, 
now computer generated, were similar to those shown in Figure 1 but now 
paper and pencil stimulus and response materials were eliminated. A 
green on grey video display, 30-cm diagonal measurement, was viewed at a 
distance of approximately 70 cm. Responses were made on a computer 
keyboard with one key indicating the observer's decision that the event 
on the left was more likely to occur and a second key indicating a deci- 
sion in favor of the event on the right. A press of the RETURN key 
recorded everything for that trial and presented the next cue set. 

Several predictions concerning the use of the reliability infor- 
mation were made prior to the experiment (see Figure 2). The underlying 
rationale for the predictions is that subjects will react to the manipu- 
lation of reliability in a rational manner. That is, they will give 
more weight to high reliability sources and less to low reliability 
sources. It is by no means obvious that this will occur, since subjects 
who are approaching the limit of their ability to process and integrate 
probabilistic information may simply ignore considerations of relative 
source reliability. 

Prediction 1. The proportion of choices for T-2 should be smaller 
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The above items illustrate one item each from the five treatments with outliers at 
the low end of the probability scale. Each treatment had 10 different comparisons to be 
judged, each comparison appearing twice, reversing the event, A or B, with which the 
outliers were associated. 

Outlier arrays had either three or four sources designated. Non-outlier arrays always 
had two sources. The three-source outlier arrays were 30-70-80, as in the example above. 
The four-source arrays were 20-50-60-70. 

In each treatment, the 30-70-80 array was compared with five non-outlier arrays 40-60, 
50-60 # 50-70, 60-70, and 60-80. The 20-50-60-70 array was compared with 30-50, 40-50, 
40-60. 50-60, and 50-70. 

The reliability vaiiable was manipulated across treatments as shown in the examples. 

Tl: all sources were designated M, medium reliability 

T2: the outlier was designated H; highly-reliable, all others M 

T3: the outlier was designated L, low reliability, all others, M 

T4: the second highest source in the outlier array was designated H, all others M 

T5: the second highest source in the outlier array was designated L, all others M 

The critical comparisons in this study are the proportions of times the subject 
chooses the event associated with the outlier array with an H or an L value versus the 
same outlier arrays with all M values, i.e., the proportion of outlier event choices 
in T2 with that in Tl, T3 with Tl, T4 with Tl and T5 with Tl. All such comparisons of choices 
made are against the control set of non-outliers described above, an example of which is 
shown in tne figures as the sources relevant to Event B. 



Figure 2. The Plan of Study 2 
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Figure 2-Continued 
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The above items illustrate one item each from the five treatments with outliers 
at the high end of the probability scale. Except for tne actual values associated 
with the sources of information, all features of the study are the sane as for the outliers 
at the low end of the probability scale. 

The arrays of data associated with the possible events are 
OUTLIER ARRAYS 



20-30-70 



vs. 



20-40 
30-40 
30-50 
40-50 
40-60 



30-40-50-80 



30-50 
40-50 
40-60 
50-60 
50-70 



The reliability variable was manipulated across treatments as follows: 

T6: all sources desiqnated M 
T7: the outlier was designated H 
T8: the outlier was designated L 

T9: the second lowest source in the outlier array was designated H 
T10: the second lowest source in the outlier array was designated L 
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than for T-l. (For definitions of treatments T-l through T-10, see 
Figure 2.) A highly reliable low outlier would be more heavily weighted 
(subjectively) thereby decreasing the average of its cue set. This 
lower average, in comparison with the average of the control set which 
was composed of all medium reliability cues, should be reflected in 
fewer choices of the T-2 set in comparison with those of the control T-l 
set. 

Prediction 2. The proportion of choices for T-3 should be greater 
than for T-l. A low outlier of low reliability would be less heavily 
weighted thereby increasing the average of its cue set. This higher 
average, in comparison with the average of the control set, which, 
again, was composed of all medium reliability cues, should be reflected 
in more choices of the T-3 set in comparison with the T-l set. 

Prediction 3. The proportion of choices for T-4 should be greater 
than for T-l. A non-outlier of high reliability would be more heavily 
weighted, thereby giving a higher average than that for the control set 
and more choices of the T-4 set in comparison with the T-l set. 

Prediction 4. The proportion of choices for T-5 should be smaller 
than for T-l. A non-outlier of low reliability would be less heavily 
weighted, producing a lower average than that of the control set and 
fewer choices of the T-5 set compared with the t_i set. 

These four predictions deal with the cue b^fcs containing ^ow 
outliers. Four predictions relating to cue sets containing high 
outliers are essentially complementary to those four just listed. 

Prediction 5. The proportion of choices for T-7 should be greater 
than for T-6. A highly reliable high outlier would be more heavily 
weighted, increasing the average of its cue set in comparison with the 
average of the control set, which was composed of all medium reliability 
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cues. 

Prediction 6. The proportion of choices for T-8 should be smaller 
than for T-6. A high outlier of low reliability would be less heavily 
weighted; thus, a lower average. 

Prediction 7. The proportion of choices for T-9 should be smaller 
than for T-6. A non-outlier of high reliability would be more heavily 
weighted, thereby reducing the average for T-9. 

Prediction 8. The proportion of choices for T-10 should be greater 
than for T-6. A non-outlier of low reliability would be less heavily 
weighted giving a higher average for T-10. 

Complete replication of the 100 experimental trials provided an 
evaluation of the reliability of each subject's decision strategies. 
Comparing the first set of 100 trials with the second set of 100 trials 
(in differing random orders) gave reliability indices within the range 
from .68 to .99, with a mean of .87 and a standard deviation of .09. 
Reliability in this context means the proportion of times the same 
events were selected in the corresponding event-pair configurations in 
both replications. 

Table 4 is a summary of the results of this study. Individual sub- 
ject data, along with the pooled information, are presented. 

From the summary of all subjects shown a+, the top of Table 4 it can 
be seen that the direction of difference was confirmed in six of the 
eight predictions. Of these six, only three of the differences are of 
sufficient magnitude to be labeled statistically reliable (p < .05 for a 
Type I error). Of the two predictions that were not confirmed, only one 
of the differences is reliable, that for Prediction 8, and this dif- 
ference is in the direction opposite to that which was predicted. 

Of the total of 8 (per subject) times 19 (subjects) = 152 predic- 
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Table 4. Overall Summary 
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These tables show the individual subject preferences and the summary for all 19 sub- 
jects for the respective treatments described in the text. The maximum number of 
preferences, choices, for the outlier event was 20. 
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tions made on an individual subject basis, a majority of predictions in 
terms of direction of difference (80) were confirmed, with 26 being in" 
the wrong direction and 46 scored as ties. This finding is informative, 
not so much for the support of the predictions, but in pointing up a 
shortcoming of the use of these 20 particular event-pair configurations. 
It had been assumed that use of an averaging strategy by observers would 
predominate, and it did, but when the averaging strategy was not used, 
other simplifying strategies raised the baseline of choices to such a 
high level that a ceiling effect became operative and higher scores 
became impossible. Thus, the large number of ties. This ceiling effect 
really precludes any firm conclusions based on these data. Currently, 
this study is being replicated with a different set of event con- 
figurations, selected in light of the data described. 

A conceptual analysis of the several meanings of "reliability" and 
"unreliability" and of the ways in which the construct can be manipu- 
lated and measured is needed. In some ways, the perceived reliability 
of information may well lie at the core of some problems in human 
inference and information processing. Whether to throw out the data or 

change one's opinion based on the data ought to be a central issue, 
Doth in prediction research and in research in which the data are used 
as feedback for predictions. This experiment simply did not get at the 
limited aspect of the larger issue as intended. Other research, related 
to the whole thrust of this program, does suggest an important hypothe- 
sis. It may well be that a crucial aspect of error in the data (i.e., 
unreliability) is whether the subject perceives it to be error in a 

measurement error sense (close but not exact) or perceives it to be 

* 

error in an all-or-none sense. The latter means that the subject per- 
ceives the information to be probably exactly correct, but that if it is 



not exactly correct, it is " garbage / 1 The latter form of perceived 
error has much more influence on subjects. A similar distinction will 
be drawn later in the discussion of the distributions of errors in 
subjects 1 responses and the impact of those error distributions on the 
experimenters 1 interpretation of the data. 

V. Study 3. Time Pressure as a Factor in Strategy Use 

Another variable considered to be of major importance in assessing 
decision making behavior is the effect of time pressure, i.e., limited 
time for information processing. 

As compared with the situation in which the decision maker has vir- 
tually unlimited time for information processing, what will be the 
effect of reducing this processing time to some interval below the 
amount of time normally used? Will strategies change? Will the 
reliability of the decisions decrease? Will information be discounted 
(i.e., ignored, or receive reduced consideration) with shorter available 
processing time? 

A new inventory of 102 problems was constructed. This new inventory 
was similar to the JSH problem set but was made up of special types of 
items for answering the questions posed. The display and response 
equipment were the same as in Study 2 with, of course, different soft- 
ware. Subjects were 22 junior and senior biology and psychology stu- 
dents who were paid for their participation. Each subject participated 
individually in two sessions 1 week apart. In each session, the set of 
102 problems was presented in a random order followed by the same set in 
a different random order. Half of the subjects received the self-paced 
treatment first, followed a week later by the time pressure session (4 
seconds for responding)* The other half responded under time pressure 
first, with the self-paced treatment a week later. 
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Table 5 shows the mean reliabilities and mean number of non- 
responses for both groups. As in Study 2, subject reliability, or con- 
sistency of subjects 1 responses, is the proportion of identical 
responses between first and second presentations of the same stimuli 
within the same treatment. In the present study though, the proportions 
are based only on those arrays to which subjects responded on both occa- 
sions. In other words, an array which drew a non-response on one pre- 
sentation but was responded to on the other was not counted. 

The numbers of non-responses in the pressure-first treatment is more 
than 2.5 times the numbers of non-responses when pressure appeared in 
the second session. This implies that practice in a non-pressure regi- 
men may have familiarized subjects sufficiently with the task to enable 
them to process information rapidly and to reduce, markedly, the number 
of non-responses with the restricted response time. 

The overall reliability for the self -paced treatment was .73 and for 
the time pressure treatment, .71. For the group which responded under 
time pressure first, reliability was .65 under time pressure and .74 
when self-paced. For the self-paced-first group, reliability was .73 
when self-paced and .77 under time pressure. 

What about strategies? All the items in this study were selected 
such that the arithmetic means were identical for each pair of events to 
be compared. For example, one event might have cues of .20, .60, and 
.80 and be paired with an event with cues of .40, .50, .70. Another 
event pair might have cues of .30, .60, .70, .80 for one event and .40, 
.50, .70, .80 for the other. Others might be .50, .60, .70 compared 
with .40, .80; or .20, .30, .70 compared with .40. In other words, an 
Averaging strategy using all the information available would not discri- 
minate between the likelihoods of the event pairs. 
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Table 5. Mean Reliabilities and Mean Numbers of Missing Responses 



Both Groups (n ■ 22) 
Mean Reliability- 
Missing Responses 

Time Pressure First (n ■ 11) 
Mean Reliability- 
Missing Responses 

Self Paced First (n - 11) 
Mean Reliability 
Missing Responses 
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.73 

0 



Time Pressure 
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.71 
9.30 
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13.50 
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5.20 
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Three strategies were defined in the following ways: 

1. In an Outlier strategy, subjects could have discounted an 
outlier in either or both of the event sets of cues and made a judgment 
of greater likelihood of occurrence with only the remaining Information, 
This is essentially the complete discounting of any outlier. 

2. In a Cluster strategy, subjects showed preferences for cue sets 
with lower variability; for example, .40, .50, .60 as compared with .20, 
.50, .80. Only cue sets with three or more cues in each set were used 
for this analysis. 

3. In an Adding stracegy, subjects could choose the cue set that 
had the higher sum. This is perfectly confounded with a Most Cues stra- 
tegy, when the average is held constant, but since Study 1 showed that a 
Most Cues strategy was virtually never used, and that an Adding strategy 
was next most likely to be used (compared to Averaging), it seems more 
parsimonious to call this an Adding strategy. 

Table 6 shows the overall percentages of all responses to those 
pairs of events amenable to the three strategy analyses that could have 
been used by the 22 subjects. The table also shows the number of sub- 
jects who used these strategies. Strategy use simply implies a response 
that is consistent with that type of strategy but does not establish 
that the actual cognitive operations concomitant with that strategy were 
used. 

Use of one or more of the three strategies listed in Table 6 does 
not preclude use of the other strategies since some inventory items 
could be evaluated according to more than one of the three strategies. 
Additionally, subjects could use different strategies at different times 
-- a mix of strategies. Basically, what the table shows 1s heaviest 
"use 11 of an Adding strategy followed by an Outlier strategy when an 
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Table 6. Strategy Utilization 
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a. Mean Percentages of Strategy "Utilization" 

Strateg y Self Paced Time Pressure 

Outlier 62# 60% 

Cluster SOt UQ% 

Adding Q2% 83$ 

b. Numbers of Subjects "Using" These Strategies a Significant Amount 

Strategy Self Paced Time Pressure 

Outlier l£ 17 

Cluster 7 13 

Adding 21 20 
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Averaging strategy is non-diagnostic. 

The most interesting result of this study is the facilitating effect 
of the early, self-paced trials on later time-pressured responses. 
Apparently, even at the level of formation and use of simple information 
processing strategies, a considerable amount of time may be required to 
form the integration rule to be used, and the "performance" aspect of 
that rule application may take a fair amount of practice. 

The Discounting and Cluster strategies, which are similar, did seem 
to be used by subjects, but the predominant strategy seemed to be just 
to go with the most votes, or with the biggest sum. When one takes away 
the strategy of choice of most subjects, i.e., Averaging, they seem to 
dip into a bag of strategy tricks and to use what seems appropriate to 
the situation at hand, rather than revert to a second most favored stra- 
tegy which they then stick with over all problems. The technique of 
taking away the favored strategy, or rather of precluding its use by 
making it irrelevant, seems to be a potentially useful one to permit the 
exploration of the set of strategies subjects actually have available. 
That set may be a large one, in spite of the quite common use of 
Averaging as a simplifying strategy* 

VI. Study 4. Information Use as a Function of Cue 
Distribution Variables 

The fourth study in this sequence looked at an observer's estimates 
of overall likelihood without making a decision as to which of two 
events had a greater probability of occurrence. A single linear display 
of probabilities was shown on each trial, and the subject simply indi- 
cated a "likelihood of occurrence" by moving a cursor above the display 
to a point that indicated this likelihood. Figure 3 shows this type of 
display with some sample. problems. 
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Figure 3. Four sample problems. Problems A and C have low 
outliers; problems B and D have high outliers. Problems A 
and B have symmetric clusters; problems C and D have assym- 
etric clusters. All four problems have an outlier distance 
of .3 and a density of 3. 
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This task is an abstract analogue of a real-world situation. A per- 
son who has several sources of information bearing on the possible 
occurrence of a single future event is asked to estimate the overall, 
probability of the occurrence of that event. As in the previous stu- 
dies, the information is presented in the form of probabilities. 

Four variables were manipulated systematically to analyze not only 
their individual effects on this task but their possible combined, or 
joint, effect. One variable was the probabilistic distance (either .20, 
.30, or .40) between an outlying bit of information and the rest of the 
clustered information. The second variable was the direction from the 
cluster in which the outlying information was to be found (either above 
or below, i.e., higher than or lower than the cluster on the probability 
scale). Variable three was the density of the clustered information — 
either two, three, or four pieces of information located within the same 
short range. Variable four, symmetry, placed the clustered information 
(not the outlier) either directly in the middle of the display or 
slightly to the left (lower) or right (higher) part of the display* 

Subjects were run individually and were self-paced. Each subject 
received five presentations of each of the 36 stimulus configurations: 
(3 levels for variable 1) x (2 levels of variable 2) x (3 levels of 
variable 3) x (2 levels of variable 4) - 36. 

Of the 20 subjects, two did not understand the task and were elimi- 
nated from any analysis. Of the remaining 18, two more were set aside 
because of an unusually strong tendency to use only the probability of 
the highest cue value. Inclusion of the data of these latter two 
subjects changes no conclusions, nor affects any significance tests. 
The results for the remaining 16 subjects are summarized in Table 7. 

Three of the four variables showed statistically reliable effects. 
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Table 7. Marginal Means of Directional Deviation Scores 
for Reduced Sample (n -•• 16) 



Outlier 
Direction 



Symmetry 



Low 
X = -.00514 
o = .03673 



Asymmetric 
X = .00136 
o = .03497 



High 
X = -.00116 
o = .03575 



Symmetric 
X = -.00765 
o - .03703 



Density 



Outlier 
Distance 



2' .2 
X = -.01051 X = .00353 
o = .02854 o = .02487 



X 
o 
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-.00033 
.03607 



X 
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,00363 
.03495 



X 
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.00140 
.04196 



X 
o 
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00934 
04512 
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First, as the outlier distance increased from .20 to .30 to .40 units 
away from the nearest other datum, the outlier's effect shifted from 
reducing the estimated likelihood for low outliers and increasing the 
estimated likelihood for high outliers, respectively, to a discounting 
effect which increased the estimated likelihood for low outliers and 
decreased the estimated likelihood for high outliers. Second, as the 
density of the non-outlying cues changed from two to three to four, the 
outlier had an increased effect in terms of changing the estimated like- 
lihood of the to-be-predicted event in the direction of the outlier, 
away from the cluster. In other words, it appeared as though the 
increased density of the non-outliers, the cluster, became more pro- 
nounced, and the observers weighted the outlier more heavily in esti- 
mating overall likelihoods. This effect was contrary to what had been 
expected from the verbal reports of many subjects in earlier studies. 
Third, symmetric arrangements of displays, i.e., clusters centered about 
.50, produced average likelihood estimates toward the extreme end of the 
cluster (away from the outlier) as compared with an arithmetically 
calculated average. Asymmetric arrangements gave mean likelihood esti- 
mates away from the arithmetic average in the direction of the outlier. 
Direction of the outlier from the clustered information showed no sta- 
tistically reliable effect on the absolute difference scores, nor did 
the interactions of any of the set of four independent variables. 

In this study, a clear outlier discounting effect emerged, buc the 
outlier must be relatively extreme before the discounting occurs. 
Since, in Study 3, outliers were defined as being .30 unit away from the 
nearest datum favoring the same event, the use of the outlier 
Discounting strategy in Study 3 may have been attenuated. Discounting 
is probably, based on the data of Study 4, a potent phenomenon, provided 
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that the outlying datum is actually extreme. 

The failure to find a directionality effect is somewhat surprising, 
given the general finding that negative information is more salient than 
positive information. It is not certain, though, that all subjects pro- 
perly conceived low probabilities as negative information, or as infor- 
mation that was to be taken as evidence against the occurrence of the 
event. 

The density effect is paradoxical. Outlying data are discounted 
because these sources are too different from other more coherent sour- 
ces. This implies that as data in the clusters get tighter, subjects 
should perceive such data as more reliable. They do not. Any explana- 
tion at this juncture would be purely ad hoc, so it will be left as a 
paradox. 

VII. Study 5. Presentation Mode and Allocated Processing 
Time as Factors in Estimating Numerical Averages 

How information is displayed to an observer is the last major 
variable to be evaluated in this series. In all studies so far, the 
information presented has been on a geometric numeric (GN) scale. That 
is, probabilities have been indicated on a scale in which equal distan- 
ces between equally different probabilities were reproduced geometri- 
cally (see Figure 4). Another scale, or list, which shows only the 
values of the probabilities in list form (LF) without geometrical repre- 
sentation is also presented in Figure 4. A third representation is made 
up of the familiar histogram bars (HB) where the heights (and areas) of 
the bars present the information. 

The question now is: If the quantitative values of the information 
are the same in all presentation ir.Hes, will responses also be the same? 
If not, how will they differ? 
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Figure k* An example of information in the Histogram Bars (HB), 
Geometric Numeric (GN)> and List Form (LF) displays. 
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The display formats were presented to subjects with three different 
times for processing the information and for two different amounts of 
information. 

This study used 15 undergraduate subjects all of whom made judgments 
(a) of cue sets composed of thrse or five sources of information, (b) 
with presentation times of 3, 6, or 9 seconds, and (c) with all three 
types of displays, GN, LF, HB. The judgment to be made was a simple 
average of the three or five cue values shown on the video screen. The 
response was written on paper by the subject during a 4-second 
intertrial interval . 

Figure 5 gives a summary of the results of this study in several 
different ways for ease in assessing the effects of the three indepen- 
dent variables and their interactions. In Figure 5 note that the 
measures are the averages of the absolute values of the differences bet- 
ween the arithmetically correct average and the subjects' estimated 
averages, i.e., the average absolute error per trial. 

The three top graphs show the main effects of the three independent 
variables. The different formats (HB, GN, LF) all give about the same 
average error overall, with the HB displays showing the highest pro- 
cessing error (5.91), the GN displays showing the least error (4.92), 
and the LF in between (5.07). Although the differences are small, they 
are statistically reliable. 

The different display times show superior accuracy for the 9-second 
processing time, followed by the 6-second time, with the 3-second time 
allocated to processing the least accurate. These average errors are 
3.86, 4*69, and 7.35, respectively. These differences are also 
reliable. 

Finally, as expected, the error for five cues is reliably larger 
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Figure Row ] shows the significant effects for each of the three main factors. 
Row 2 shows the three non-statistically significant two-factor interactions. Row 
3 shows the two graphs which together allow interpretation of the non-statistically 
significant three-factor interaction. 
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than the error for three cues: 7.10 compared with 3.50. 

Although the results of this study are quite orderly, there is an 
additional issue to be considered. Of the 15 subjects in the analyses, 
four did best with histograms, five with geometric numeric information, 
and six with the simple numeric list. 

The differences for processing time and amount of information serve 
primarily to show that the results are, in fact, orderly. The relative 
superiority of the GN over the HB looks, at first glance, like a stimulus 
response compatibility effect, since both stimuli and responses are in 
numeric form, except that the GN display is also superior to the LF. It 
seems that even with a task as simple as averaging three or five digits, 
the spatial representation of the metric relations carried in the GN 
display enables better performance. It may be that the geometric 
displays facilitated some sort of error-checking routine, or made some 
sort of intuitive (rather than analytical or algorithmic) processing 
more likely. In either event, different error distributions would be 
expected in the GN than in the LF displays, a hypothesis that has not 
yet been assessed. In fact, error distributions may eventually be of 
considerable interest. Two displays could have the same average error, 
with one having many small errors but very few exactly correct respon- 
ses, and with another having very flew errors, but large ones. This dif- 
ference in error distributions is precisely what Brunswik predicted and 
found dhen he contrasted intuitive and analytical thought, or perception 
and reasoning (Hammond, 1966). A difference of this sort could, of 
course, have profound implications for many activities of concern to the 
Air Force. The rare but very large judgmental error 1s almost certainly 
a far more serious problem than is the frequent but small error. 

The apparently (but not actually) anomalous degree of individual 
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differences reflected in the study makes any generalizations from these 
data tenuous* While the small number of subjects and the within design 
(with the consequent difficulty of interpretation) preclude unqualified 
generalizations, the superiority of the GN display is most interesting 
and very likely is not situation specific. Essentially, the GN display 
seems to embody the stimulus characteristics that elicit the best 
features of both intuitive and analytical modes of thought. 

VIII. Study 6. Two Additional Investigations 

The next two experiments were conducted in addition to the original 
effort. Both studies further examined the effects of unreliability in a 
context different from those already described. These experiments used 
a paradigm developed by Doherty, Mynatt, Tweney, and Schiavo (1979) 
which showed that people tend to seek and use diagnostically worthless 
information when diagnostically valuable information is easily 
available. This experimental paradigm is called pseudodiagnosticity. 
IX. The Bayesian and Pseudodiagnosticity Paradigms 

Assume someone has one of two and only two ~- possible diseases. 
Assume that a probability can be assigned to disease A, P(A), and a 
complementary probability to disease B, P(B), where P(B) = 1 - P(A). 
These probabilities are called prior probabilities, or simply, priors. 

Assume now, subsequent to the assignment of these priors, two symp- 
toms appear, X and Y, both of which have a known relationship to each of 
the diseases A and B. All this information is shown in Table 8. 

An example of the information described in Table 8 might be the 
following illustration. 
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Table 8. Probabilities Utilized 
in Simplified Paradigm 



Disease A 
P(A) 



Disease B 
P(B) 



Symptom X 


P(X/A) 
P(X/A) 


P(X/B) 
P(X/B) 


Symptom Y 


P(Y/A) 
P(Y/A) 


P(Y/B) 
P(T/B) 



where 

P(A) 
P(B) 
P(X/A) 

P(T/A) 

P(Y/A) 

P(Y/A) 

P(X/B) 

p("x7b) 

P(Y/B) 
P(Y/B) 



s the prior probability of disease A 
s the prior probability of disease B 
s the probability of symptom X occurring given 
that disease A is present 

s the probability that symptom X does not occur 

given that disease A is present 
s the probability of symptom Y occurring given 

that disease A is present 
s the probability that symptom Y does not 

occur given that disease A is present 
s the probability of symptom X occurring 

given that disease B is present 
s the probability that symptom X does not 

occur given that disease B is present 
s the probability of symptom Y occurring given 

that disease B is present 
is the probability that symptom Y does not occur given 

that disease B is present 
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Row 1 



Row 2 





Disease A 
P(A) = .50 


Disease B 
P(B) = .50 


Small Body Rash 


.65 


.25 


Absence of Rash 


.35 


.75 


Elevated Heart Rate 


.80 


.40 


Absence of Elevated 
Heart Rate 


.20 


.60 



Now, suppose that a diagnostician can request the information to be 
found in any two of the cells shown in the illustration for purposes of 
correctly diagnosing the disease. From which two cells should this 
information be chosen to enhance the chances of a correct diagnosis? 

The correct answer is a straightforward application of Bayes' 
Theorem. Either the two cells in row 1 or the two cells in row 2 should 
be selected since corresponding information for both diseases must be 
obtained. Any other pair of cells provide worthless information. 

The appropriate calculations for the selection of row 1 information 

are 

P (A/Rash) = P( A)P(Rash/A) 

P(Rash/A)P(A) + P(Rash/B)P(B) 

(.50) (.65) 

(.65H.50) + {.25)(.50) 



and 
P(B/Rash) 



= .72 
= .28 
For row 2 

P(A/Heart) = 



and 



P(B/Heart) 



P ( A) P( Heart/A) 
P(Heart/A)P(A) + P(Heart/B)P(B) 

(.50) (.80) 

(.80)(.50) + (.40H.50) 



= .67 
= .33 
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What do people do when confronted with problems like this in labora- 
tory situations? Several published and unpublished studies have shown 
that undergraduates, graduate students in business administration, and 
medical residents, among others, ask for the wrong data most of the 
time. Medical residents tend to want "confirmatory 11 Information and 
call for the data in columns. University undergraduates generally 
prefer information contained in a diagonal* 

Do people learn to ask for the correct information? Doherty, 
Schiavo, Mynatt, and Tweney (1981) using similar problems had subjects 
select data and make decisions as In the disease problem described* 
These subjects then were given feedback as to whether their decisions 
were consistent (correct or incorrect) when compared with the probabi- 
listic model. Additionally, half of the subjects were then given a 
third cell of the information matrix to guarantee that they would see a 
properly diagnostic pair of cells. These subjects were again asked to 
do several more problems. 

The results were clear. When subjects selected the wrong data but 
made the right choice, they continued to ask for the wrong data. When 
subjects selected the wrong data, made the wrong choice, and were not 
given the third bit of information, they also continued to ask for the 
wrong data. Only when subjects selected the wrong data, made the wrong 
choice, and were given the third unit of information did they shift to 
the optimal data selection strategy. 

In the study just described feedback as to the correctness or 
Incorrectness of the diagnostic decision was always perfectly reliable. 
That is, 1f the arithmetically calculated correct probability of choice 
A was greater than the probability of choice B, the subject was always 
told that A was the correct choice. Even though this procedure guaran- 
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tees maximum diagnostic performance, the real world is not structured 
that way. People often make the best choice possible but find out that 
they were wrong. How often? In the Bayesian framework, the proportion 
of times is (1 - P), where P is the probability of an event after the 
new information has been taken into account. 

One of the focal questions of concern in this effort was the impact 
of unreliability. Specifically, the issue of unreliability in feedback 
was investigated using the pseudodiagnosticity paradigm. Essentially, 
this involved providing feedback according to the actual probabilities 
that a diagnosis would be correct or incorrect. Thus, in one treatment, 
the subjects were provided feedback that was fairly typical of psycholo- 
gical laboratories and class demonstrations: i.e., the feedback was 
always of the right sort. If the subjects made the optimal choice of 
data, they were virtually certain to make the best (most probable) deci- 
sion, and then the artificial environment would tell them they were 
right. The other treatment provided feedback much more like that 
occurring in the real world. That is, the feedback itself had the 
character of being uncertain, of being predictable only in a probabi- 
listic sense. 

X. Study 6. Reliable and Unreliable Feedback 

All subjects, 29 students in a nursing program taking a course in 
statistics, were run individually using the video display and computer 
keyboard described earlier. Subjects were randomly assigned to one of 
two experimental treatments. One treatment provided feedback to 
subjects after their responses according to the rule that says if the 
probability of the chosen disease given the two symptoms is greater than 
.50, P(D/Si, S2) > .50, then the response is called correct 100% of the 
time. This is feedback F r . 
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The second treatment provided feedback after responses according to a 
rule which says if the probability of the chosen disease given the two 
symptoms is P(D/Si, S2) ~ whatever its value then the response is 
called correct P(D/Si, S2) proportion of the time. In other words, the 
feedback for correctness or incorrectness of the choice was randomly 
determined according to the arithmetically correct proportions of times 
disease A and disease B occurred. Thus, sometimes "correct" (i.e., most 
probable) diagnoses would be labeled incorrect, and sometimes incorrect 
diagnoses would be labeled correct. This is feedback F u . 

In both treatments all subjects saw the same displays and performed 
the same 40 diagnostic tasks, but 15 subjects received one kind of feed- 
back and 14 the other. The displays were like those already described 
in the 2 x 2 matrix except that all information was electronically 
masked at the outset. The computer then randomly selected one of the 
four cells for the pertinent diagnostic information and presented it on 
the video display. A second one of the four cells was then selected by 
the subject and this cell's information was also shown on the display. 

With these two probabilities, the subject made the diagnostic 
response, either disease A or disease B. Immediately thereafter, the 
display informed the subject of the correctness of the response 
according to scheduled feedback F r or F u and presented the entire 2x2 
array of diagnostic information for the subject to inspect. A new trial 
with new information was initiated at the subject's discretion by 
depressing the RETURN key. 

What were the results? The great majority of subjects never adopted 
the statistically appropriate strategy. Only three subjects in the 
F r treatment switched to the appropriate strategy. Three Started with 
the appropriate strategy and never deviated from it. In the 
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F u treatment, no one started with the appropriate strategy, three 
adopted it and used it on at least seven consecutive trials, and one 
abandoned it after having a rational choice identified as "wrong." The 
small number of people adopting the appropriate strategy in the 
F r treatment deprived this study of an adequate baseline against which 
to assess the possibly detrimental effects of unreliability in the feed- 
back. There was an overall positive effect of the independent variable, 
F r vs. F u> j. e#> i n the p r treatment, 44% of the choices were the 
appropriate diagnostic type compared with 33% of the choices in the 
F u treatment. 

Why didn't more people in the F r treatment adopt or switch to the 
appropriate strategy? Probably because there was still a fairly high 
proportion of responses that received feedback as "correct" even when 
the strategy in arriving at the response was incorrect. As expected, 
when people are reinforced, they keep doing what they have been doing. 
XI. Study 7. Reliable and Unreliable Feedback II 
With modified conditional probabilities relating symptoms to 
diseases, 13 more subjects from a statistics course participated in an 
experiment methodologically identical to Study 6. Again, subjects in 
the F r treatment did not really learn to adopt the appropriate strategy 
for information acquisition. Nonetheless, it is important to note that 
the results of these two experiments can be interpreted in a very direct 
fashion. 

The two experiments are strong replications of the Doherty et al . 
(1979, 1981) findings and show the power of the pseudodiagnosticity 
effect. Subjects with correct data available at the push of a button 
are more likely to select and to use incorrect aata when making a 
diagnostic decision. This is even more surprising considering that sub- 
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jects in these two studies were in a highly selective nursing program, 
had a substantial introduction to statistical thinking, and worked 
through 40 experimental trials. Furthermore, on each trial, after 
making a response and obtaining feedback, the subjects then saw all of 
the potentially available information. They simply did not bring the 
necessary cognitive operations to bear on the problem. 

The study of the effect of error in the feedback, in a variety of 
task environments, is potentially extremely important. It is a univer- 
sal human tendency to want performance feedback. But if performance 
feedback of various sorts can be shown to be disruptive when that feed- 
back is sufficiently laced with error, then such feedback probably ought 
to be withheld, no matter what the learner wants, at least early in the 
learning process. The multiple cue probability learning literature pro- 
vides clear examples of situations in which feedback disrupts both 
learning and performance. Given the widespread belief that feedback is 
always a "good thing," given the clear power of feedback from both rein- 
forcement and informational standpoints, and given what is assumed to be 
an uncontrovertible fact that feedback in the world is itself strong in 
uncertainty, investigation of the effects of error in feedback seems 
critical . 

The authors plan to pursue the problem with minor modifications of 
the procedure used here. The essential modification to be made, inter- 
estingly, is a change in the direction of greater representativeness. 
That is, if a disease — or any other type of system failure — does 
have more than one symptom, then those symptoms should not be indepen- 
dent of one another in the real world. In the jargon of probability 
theory they should be "conditionally dependent." In these two 
studies, all P(S/D) values were made to be statistically independent. 

ERIC 48 5 3 



Introducing precisely the dependence that probably exists in the world 
should have the effect of making erroneous strategies of symptom choice 
less serendipitously informative. This should permit development of a 
baseline of good performance in an F r treatment so it can be observed 
whether F u has the disruptive effect predicted. 

XII. Discussion 

First, a novel framework was used in which decision makers compared 
evidence for the likelihood of one event with the evidence available for 
the likelihood of another event, then chose which of the two events was 
more likely to occur. The evidence; for each event was in the form of 
probabilities and was diagnostic (useful for predictive purposes) for 
the occurrence of that, and only that, event. Then a comparison was 
made of the use of some simple types of strategies among UPT pilots and 
F-16 trainees. Few differences were found among these pilot types but 
strong indications for the predominant use of a type of Averaging stra- 
tegy — a result that has been obtained in other multiple cue probabi- 
lity processing situations. 

The inference was that most pilots and pilot trainees averaged 
information systematically with the varying numbers of cues tested. 
Nevertheless, there was a small number of subjects whose behavior in 
this primitive, and perhaps fundamental, cognitive task was completely 
inconsistent with averaging. Some subjects clearly used one or another 
of the alternative strategies that had been hypothesized. From these 
data alone, it cannot be determined whether these unusual strategies for 
processing information would generalize across other tasks since 
constraints on the pilots 1 time did not permit using these same pilots 
in different tasks. If future researh shows these differences to be 
consistent across information processing tasks, then there may be 
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situations in which it would be desirable to pre-select individual* and 
to compose groups for homogeneity of cognitive processes. Equally 
important, there may be situations in which heterogeneity of thinking 
style would be highly desirable. 

In a series of laboratory studies with college students, two 
variables considered to be important in real-life situations were 
examined: (a) the reliabilities of the information sources (evidence) 
used to predict the occurrences of events, and (b) the presence of 
disparate information sources (outliers). It was found that the 
reliabilities of the sources do have an effect of the choice of an event 
inferred to be more likely uo occur, but the results are not always con- 
sistent with what would be expected. Additional research is investi- 
gating this effect further. Outliers were found to have effects both in 
making decisions and in estimating averages of probabilities, and other 
variables were identified that are important in considering these 
effects. These other variables included the overall symmetry of the 
display of information, the magnitude of difference between the outlier 
and the cluster, and the size of the clustered information. 

This effort examined the effects of allocated time for processing 
information in two studies. In one study, performance with this reduced 
time, 4 seconds, was compared with performance with essentially unlimited 
time. Reliability of respondents 1 decisions changed, as did certain 
types of strategy use. In the second study, three allocated processing 
times showed greater accuracy in estimating average probabilities of 
occurrence for 9, 6, and 3 seconds, in decreasing order. This same 
study also examined the effects of three different types of displays: a 
bar graph of probabilities, a scaled list of probabilities with 
appropriate interval spacing, and a simple list of probabilities con- 
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si sting only of numerical information. Differences in accuracy using 
the three types of displays were small but responses for bar graphs 
showed less accuracy than did the lists. Among the subjects, however, 
an almost equal number of people performed best with each of the three 
types of displays* 

Two studies examined the way information is selected for making 
decisions in a pseudodiagnosticity paradigm* With little or no 
training, or even with a fair amount or practice, people consistently 
ask for information that has little or no value from the standpoint of 
making rationally correct decisions* , 
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